Micro-kernels for portable and efficient matrix multiplication in deep learning

نویسندگان

چکیده

Abstract We provide a practical demonstration that it is possible to systematically generate variety of high-performance micro-kernels for the general matrix multiplication ( gemm ) via generic templates which can be easily customized different processor architectures and micro-kernel dimensions. These employ vector intrinsics exploit SIMD (single instruction, multiple data) units in current general-purpose processors and, particular type problems encountered deep learning, deliver floating-point throughput rate on par with or even higher than obtained conventional, carefully tuned implementations linear algebra libraries (e.g., BLIS, AMD AOCL, ARMPL). Our work exposes structure template-based ARM Neon (128-bit SIMD), SVE (variable-length SIMD) Intel AVX512 (512-bit showing considerable performance an NVIDIA Carmel (ARM Neon), Fujitsu A64FX SVE) EPYC 7282 (256-bit SIMD).

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On Composing Matrix Multiplication from Kernels

Matrix multiplication is often treated as a basic unit of computation in terms of which other operations are implemented, yielding high performance. In this paper initial evidence is provided that there is a benefit gained when lower level kernels, from which matrix multiplication is composed, are exposed. In particular it is shown that matrix multiplication itself can be coded at a high level ...

متن کامل

Writing a performance-portable matrix multiplication

There are several frameworks that, while providing functional portability of code across different platforms, do not automatically provide performance portability. As a consequence, programmers have to hand-tune the kernel codes for each device. The Heterogeneous Programming Library (HPL) is one of these libraries, but it has the interesting feature that the kernel codes, which implement the co...

متن کامل

Efficient Matrix Multiplication in Hadoop

In a typical MapReduce job, each map task processing one piece of the input file. If two input matrices are stored in separate HDFS files, one map task would not be able to access the two input matrices at the same time. To deal with this problem, we propose a efficient matrix multiplication in Hadoop. For dense matrices, we use plain row major order to store the matrices on HDFS; For sparse ma...

متن کامل

Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors

We introduce a variational Bayesian neural network where the parameters are governed via a probability distribution on random matrices. Specifically, we employ a matrix variate Gaussian (Gupta & Nagar, 1999) parameter posterior distribution where we explicitly model the covariance among the input and output dimensions of each layer. Furthermore, with approximate covariance matrices we can achie...

متن کامل

Implementing Efficient, Portable Computations for Machine Learning

Implementing E cient, Portable Computations for Machine Learning

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: The Journal of Supercomputing

سال: 2022

ISSN: ['0920-8542', '1573-0484']

DOI: https://doi.org/10.1007/s11227-022-05003-3